The Goal

In this project I am going to test different statistical learning methods in order to predict the price per night of an Airbnb apartment in the city of Barcelona. For this purpose I will use the following methods: Ridge Regression, Lasso Regression, Bagging, Random Forest and Gradient Boosting.

The Data

The data to train the model has been obtained from this project. In this website we can find a dataset with data related to airbnb apartments in several cities around the world. Each dataset is referred to a single city, and since I am from Barcelona and also because it was one of the cities with more data, I have decided to try and predict the price per night in my home city.

Since the dataset included a lot of variables, I have decided to load only those that seem more relevant

library(tidyverse)

setwd("C:/Users/marct/OneDrive - Tecnocampus Mataro-Maresme/Documentos/CURSOS/PROJECTES/AIRBNB PRICE PREDICTION/DATA")

airbnb_data <- read.csv("./barcelona.csv") %>% select(c("host_since", "host_response_time", 
                                                             "host_response_rate", "host_acceptance_rate", "host_is_superhost", 
                                                             "host_neighbourhood", "host_listings_count", "host_total_listings_count",
                                                             "host_verifications", "host_has_profile_pic", "host_identity_verified",
                                                             "neighbourhood", "city", "state",
                                                             "smart_location", "country", "latitude", "longitude",
                                                             "is_location_exact", "property_type", "room_type", "accommodates", "bathrooms", 
                                                             "bedrooms", "beds", "bed_type", "amenities", "square_feet", "price", "security_deposit",
                                                             "cleaning_fee", "guests_included", "extra_people", "minimum_nights", "maximum_maximum_nights",
                                                             "has_availability", "availability_30", "availability_60", "availability_90", "availability_365",
                                                             "number_of_reviews", "first_review", "last_review", "review_scores_rating", "review_scores_accuracy",
                                                             "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_value",
                                                             "requires_license", "instant_bookable", "is_business_travel_ready", "cancellation_policy",
                                                             "require_guest_profile_picture", "require_guest_phone_verification", "calculated_host_listings_count",
                                                             "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_private_rooms",
                                                             "calculated_host_listings_count_shared_rooms", "reviews_per_month"))

Airbnb Barcelona Dataset

This dataset contains 20864 rows (observations) and 60 columns, and as we can already see, some variables have very messy data, and this will require a lot of data wrangling.

str(airbnb_data)
## 'data.frame':    20864 obs. of  60 variables:
##  $ host_since                                  : chr  "2010-01-24" "2010-03-09" "2010-01-24" "2010-01-24" ...
##  $ host_response_time                          : chr  "within a few hours" "within an hour" "within a few hours" "within a few hours" ...
##  $ host_response_rate                          : chr  "100%" "100%" "100%" "100%" ...
##  $ host_acceptance_rate                        : chr  "91%" "100%" "91%" "91%" ...
##  $ host_is_superhost                           : chr  "f" "t" "f" "f" ...
##  $ host_neighbourhood                          : chr  "El Gòtic" "El Besòs i el Maresme" "El Gòtic" "El Gòtic" ...
##  $ host_listings_count                         : int  3 6 3 3 4 4 4 4 1 7 ...
##  $ host_total_listings_count                   : int  3 6 3 3 4 4 4 4 1 7 ...
##  $ host_verifications                          : chr  "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']" "['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']" "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']" "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']" ...
##  $ host_has_profile_pic                        : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified                      : chr  "t" "t" "t" "t" ...
##  $ neighbourhood                               : chr  "El Gòtic" "Sant Martí" "El Gòtic" "Ciutat Vella" ...
##  $ city                                        : chr  "Barcelona" "Sant Adria de Besos" "Barcelona" "Barcelona" ...
##  $ state                                       : chr  "CT" "Barcelona" "CT" "Catalonia" ...
##  $ smart_location                              : chr  "Barcelona, Spain" "Sant Adria de Besos, Spain" "Barcelona, Spain" "Barcelona, Spain" ...
##  $ country                                     : chr  "Spain" "Spain" "Spain" "Spain" ...
##  $ latitude                                    : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ longitude                                   : num  2.18 2.22 2.18 2.18 2.15 ...
##  $ is_location_exact                           : chr  "t" "f" "t" "t" ...
##  $ property_type                               : chr  "Apartment" "Apartment" "Apartment" "Apartment" ...
##  $ room_type                                   : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ accommodates                                : int  2 6 2 9 2 1 1 2 1 4 ...
##  $ bathrooms                                   : num  1 2 1 3 2 2 2 2 1 1 ...
##  $ bedrooms                                    : int  1 3 1 4 1 1 1 1 1 1 ...
##  $ beds                                        : int  1 5 1 6 1 1 1 1 1 2 ...
##  $ bed_type                                    : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ amenities                                   : chr  "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,Elevator,Heating,\"Family/kid friendly\",Washer,Dry"| __truncated__ "{TV,Internet,Wifi,\"Wheelchair accessible\",Kitchen,\"Paid parking off premises\",Elevator,\"Buzzer/wireless in"| __truncated__ "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,Elevator,Heating,\"Family/kid friendly\",Microwave,"| __truncated__ "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,\"Paid parking off premises\",Elevator,\"Buzzer/wir"| __truncated__ ...
##  $ square_feet                                 : int  NA NA NA NA 807 807 807 807 NA NA ...
##  $ price                                       : chr  "$80.00" "$220.00" "$100.00" "$227.00" ...
##  $ security_deposit                            : chr  "$100.00" "$300.00" "$150.00" "$200.00" ...
##  $ cleaning_fee                                : chr  "$20.00" "$80.00" "$40.00" "$67.00" ...
##  $ guests_included                             : int  2 3 1 4 1 1 1 1 1 2 ...
##  $ extra_people                                : chr  "$0.00" "$10.00" "$0.00" "$25.00" ...
##  $ minimum_nights                              : int  3 3 5 4 7 2 2 2 2 2 ...
##  $ maximum_maximum_nights                      : int  90 1125 120 1125 1125 730 1125 1125 65 364 ...
##  $ has_availability                            : chr  "t" "t" "t" "t" ...
##  $ availability_30                             : int  16 30 30 30 26 14 21 17 29 30 ...
##  $ availability_60                             : int  16 57 60 60 52 31 44 43 59 60 ...
##  $ availability_90                             : int  16 83 90 90 75 46 70 69 83 80 ...
##  $ availability_365                            : int  88 322 180 180 348 318 345 344 358 336 ...
##  $ number_of_reviews                           : int  2 52 8 149 303 238 258 222 73 339 ...
##  $ first_review                                : chr  "2017-05-16" "2011-03-15" "2010-07-10" "2010-10-03" ...
##  $ last_review                                 : chr  "2017-11-06" "2019-12-15" "2013-07-15" "2020-03-11" ...
##  $ review_scores_rating                        : int  100 95 68 91 94 95 96 95 94 94 ...
##  $ review_scores_accuracy                      : int  10 10 8 10 10 10 10 10 10 10 ...
##  $ review_scores_cleanliness                   : int  10 10 8 9 9 10 10 9 10 10 ...
##  $ review_scores_checkin                       : int  10 10 7 10 10 10 10 10 10 9 ...
##  $ review_scores_communication                 : int  10 10 9 10 10 10 10 10 10 10 ...
##  $ review_scores_value                         : int  10 9 7 9 10 10 9 9 10 10 ...
##  $ requires_license                            : chr  "t" "t" "t" "t" ...
##  $ instant_bookable                            : chr  "f" "t" "f" "t" ...
##  $ is_business_travel_ready                    : chr  "f" "f" "f" "f" ...
##  $ cancellation_policy                         : chr  "moderate" "strict_14_with_grace_period" "moderate" "moderate" ...
##  $ require_guest_profile_picture               : chr  "f" "f" "f" "f" ...
##  $ require_guest_phone_verification            : chr  "f" "t" "f" "f" ...
##  $ calculated_host_listings_count              : int  3 2 3 3 4 4 4 4 1 3 ...
##  $ calculated_host_listings_count_entire_homes : int  1 2 1 1 0 0 0 0 0 3 ...
##  $ calculated_host_listings_count_private_rooms: int  2 0 2 2 4 4 4 4 1 0 ...
##  $ calculated_host_listings_count_shared_rooms : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviews_per_month                           : num  0.05 0.46 0.07 1.26 3.01 2.16 2.62 3.17 0.69 3.09 ...

As a complement for the main dataset, I’ll be using the neighbourhoods_geojson dataset, which provides geospatial information from barcelona and its airbnb apartments (such as longitude, latitude, polygons, etc):

library(rgdal)
neighbourhoods_geojson <- rgdal::readOGR("C:/Users/marct/OneDrive - Tecnocampus Mataro-Maresme/Documentos/CURSOS/PROJECTES/AIRBNB PRICE PREDICTION/DATA/neighbourhoods.geojson")

Data Wrangling

First of all I load the lubridate package, which I use to convert the host_since variable, from a character to a date format. Then I extract the year from the variable I have just converted and reorder the data.

library(lubridate)
airbnb_data$host_since <- ymd(airbnb_data$host_since)
airbnb_data$host_since_year <- year(airbnb_data$host_since)
airbnb_data <- airbnb_data[, c(1, ncol(airbnb_data), 2 : (ncol(airbnb_data) - 1))] 

After that, I convert the neighbourhood column into a factor and realise that it has spelling mistakes.

airbnb_data$neighbourhood <- airbnb_data$neighbourhood %>% as.factor() 
levels(airbnb_data$neighbourhood)
##  [1] ""                                 "Camp d'en Grassot i Gràcia Nova"
##  [3] "Can Baro"                         "Carmel"                          
##  [5] "Ciutat Vella"                     "Diagonal Mar - La Mar Bella"     
##  [7] "Dreta de l'Eixample"              "Eixample"                        
##  [9] "El Baix Guinardó"                "El Besòs i el Maresme"          
## [11] "El Bon Pastor"                    "El Born"                         
## [13] "El Camp de l'Arpa del Clot"       "El Clot"                         
## [15] "El Coll"                          "El Congrés i els Indians"       
## [17] "el Fort Pienc"                    "El Gòtic"                       
## [19] "El Poble-sec"                     "El Poblenou"                     
## [21] "El Putget i Farró"               "El Raval"                        
## [23] "Glòries - El Parc"               "Gràcia"                         
## [25] "Guinardó"                        "Horta"                           
## [27] "Horta-Guinardó"                  "L'Antiga Esquerra de l'Eixample" 
## [29] "La Barceloneta"                   "La Font d'en Fargues"            
## [31] "La Guineueta - Canyelles"         "La Maternitat i Sant Ramon"      
## [33] "La Nova Esquerra de l'Eixample"   "La Prosperitat"                  
## [35] "La Sagrada Família"              "La Sagrera"                      
## [37] "La Salut"                         "La Teixonera"                    
## [39] "La Trinitat Vella"                "La Vall d'Hebron"                
## [41] "La Verneda i La Pau"              "La Vila Olímpica"               
## [43] "Les Corts"                        "Les Tres Torres"                 
## [45] "Montbau"                          "Navas"                           
## [47] "Nou Barris"                       "Pedralbes"                       
## [49] "Porta"                            "Provençals del Poblenou"        
## [51] "Sant Andreu"                      "Sant Andreu de Palomar"          
## [53] "Sant Antoni"                      "Sant Genís dels Agudells"       
## [55] "Sant Gervasi - Galvany"           "Sant Gervasi - la Bonanova"      
## [57] "Sant Martí"                      "Sant Martí de Provençals"      
## [59] "Sant Pere/Santa Caterina"         "Sants-Montjuïc"                 
## [61] "Sarrià"                          "Sarrià-Sant Gervasi"            
## [63] "Torre Baró"                      "Trinitat Nova"                   
## [65] "Turó de la Peira - Can Peguera"  "Vallcarca i els Penitents"       
## [67] "Verdum - Los Roquetes"            "Vila de Gràcia"                 
## [69] "Vilapicina i la Torre Llobeta"

Since I am catalan, I am able to correct the spelling mistakes.

levels(airbnb_data$neighbourhood) <- c(""   ,                              "Camp d'en Grassot i Gràcia Nova", "Can Baró",                        
                                       "Carmel"   ,                        "Ciutat Vella",                     "Diagonal Mar - La Mar Bella" ,    
                                       "Dreta de l'Eixample"    ,          "Eixample"    ,                     "El Baix Guinardó",               
                                       "El Besós i el Maresme"   ,        "El Bon Pastor" ,                   "El Born" ,                        
                                       "El Camp de l'Arpa del Clot",       "El Clot"  ,                        "El Coll" ,                        
                                       "El Congrés i els Indians" ,       "El Fort Pienc"  ,                  "El Gòtic" ,                      
                                       "El Poble-sec"  ,                   "El Poblenou" ,                     "El Putget i Farró" ,             
                                       "El Raval"  ,                       "Glòries - El Parc" ,              "Gràcia",                         
                                       "Guinardó" ,                       "Horta"  ,                          "Horta-Guinardó"  ,               
                                       "L'Antiga Esquerra de l'Eixample",  "La Barceloneta"   ,                "La Font d'en Fargues" ,           
                                       "La Guineueta - Canyelles" ,        "La Maternitat i Sant Ramon" ,      "La Nova Esquerra de l'Eixample",  
                                       "La Prosperitat"   ,                "La Sagrada Família",              "La Sagrera"  ,                    
                                       "La Salut"    ,                     "La Teixonera",                     "La Trinitat Vella"  ,             
                                       "La Vall d'Hebron"   ,              "La Verneda i La Pau" ,             "La Vila Olímpica" ,              
                                       "Les Corts" ,                       "Les Tres Torres"   ,               "Montbau"  ,                       
                                       "Navas"   ,                         "Nou Barris"  ,                     "Pedralbes" ,                      
                                       "Porta"   ,                         "Provençals del Poblenou" ,        "Sant Andreu" ,                    
                                       "Sant Andreu de Palomar"  ,         "Sant Antoni"   ,                   "Sant Genís dels Agudells" ,      
                                       "Sant Gervasi - Galvany"  ,         "Sant Gervasi - la Bonanova",       "Sant Martí"    ,                 
                                       "Sant Martí de Provençals",       "Sant Pere/Santa Caterina",         "Sants-Montjuïc"   ,              
                                       "Sarrià "   ,                       "Sarrià -Sant Gervasi" ,            "Torre Baró"   ,                  
                                       "Trinitat Nova"  ,                  "Turó de la Peira - Can Peguera",  "Vallcarca i els Penitents" ,      
                                       "Verdum - Los Roquetes" ,           "Vila de Gràcia"     ,             "Vilapicina i la Torre Llobeta") 

This variables also need to be treated, since they have special characters such as % or $, or they have to be coded as boolean factors. Here we can see some of the values that I am about to modify:

airbnb_data[1, c(4,5)]
##   host_response_rate host_acceptance_rate
## 1               100%                  91%
airbnb_data[1, c(30, 31, 32, 34)]
##    price security_deposit cleaning_fee extra_people
## 1 $80.00          $100.00       $20.00        $0.00
airbnb_data[1, c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)]
##   host_is_superhost host_has_profile_pic host_identity_verified
## 1                 f                    t                      t
##   is_location_exact has_availability requires_license instant_bookable
## 1                 t                t                t                f
##   is_business_travel_ready require_guest_profile_picture
## 1                        f                             f
##   require_guest_phone_verification
## 1                                f

Then I treat the previous variables that had special characters such as “%” or “$” using simple loops, and convert them into numeric variables. After that I substitute the “t” and “f” values from the boolean variables with TRUE and FALSE (respetively) and I convert them into factors through the last loop.

for(i in c(4, 5)){
        airbnb_data[, i] <- as.numeric(gsub("%", "", airbnb_data[, i])) / 100
}


for(i in c(30, 31, 32, 34)){
        airbnb_data[, i] <- as.numeric(gsub("\\$", "", airbnb_data[, i])) 
}


for(i in c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)){
        airbnb_data[, i] <- gsub("t", "TRUE", airbnb_data[, i])
        airbnb_data[, i] <- gsub("f", "FALSE", airbnb_data[, i])
}

Now I create a list that I will use to spot strange values and NA’s by converting all the variables in factors and copying the levels of the ith variable into the ith element of a list. I use the “jsonedit” function from the listviewer package in order to create an interactive and easy list.

airbnb_data_factor <- airbnb_data
factor_list <- list()
for(i in 1:ncol(airbnb_data_factor)){
        airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
        factor_list <- lapply(airbnb_data_factor, levels)
}

library(listviewer)
jsonedit(factor_list) 

After exploring this list I observe that the host_verifications and amenities columns need some treatment in order to be used in any statistical model (I need to create dummy variables with their values). This is an example of the format of this columns’ values.

airbnb_data$host_verifications[1]
## [1] "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']"
airbnb_data$amenities[1]
## [1] "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,Elevator,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke alarm\",\"Carbon monoxide alarm\",\"Fire extinguisher\",Essentials,\"translation missing: en.hosting_amenity_49\"}"

In the following chunk I create two dataframes, after deleting the strange or unuseful characters that I have spoted using the previous list.

library(mgsub)
library(stringr)

host_verifications <- as.data.frame(mgsub(airbnb_data$host_verifications, c("\\[", "\\]", "\\,", "\\'") , c("", "", "", "")))
amenities <- as.data.frame(mgsub(airbnb_data$amenities, c("\\{", "\\}", "\\,", "\\'", "\\[", "\\]", "\"", "\\/") , c("", "", "  ", "", "", "", "", "")))

Now I create two listes for each of the variables which I will use in the following loop. Since both variables (host_verifications and amenities) are dataframes of the same dimension, I create a loop iterating on the columns of host_verifications. This loop starts by splitting the ith row of the host_verification dataframe (and since str_split returns a list with only one element I subset the first element of each list) and saves this vector into the ith element of the output_list_verifications. Then I count the number of elements of the ith vector of the output_list_verifications list and associate this result to the ith vector of the output_list_verifications_count, which I will use later to see which is the vector with more components and create the dummy variables.

output_list_verifications <- list()
output_list_verifications_count <- list()
output_list_amenities <- list()
output_list_amenities_count <- list()
for(i in 1:nrow(host_verifications)){
        output_list_verifications[i] <- str_split(host_verifications[i, ], " ")[1]
        output_list_verifications_count[i] <- length(output_list_verifications[[i]])
        output_list_amenities[i] <- str_split(amenities[i, ], "  ")[1]
        output_list_amenities_count[i] <- length(output_list_amenities[[i]])
}

Now I extract the largest vector (the vector that contains more values) of the output_list_verifications list, by filtering the vector that has the highest value in the output_list_verifications_count list and save the result in dummy_host_verifications_cols. After that I create an empty dataframe called dummy_host_verifications_df (with the same rows than the original dataset airbnb_data and the number of columns from dummy_host_verifications_cols), and I repeat the same process for the amenities variable. Below you can see the dummy_host_verifications_df

dummy_host_verifications_cols <- output_list_verifications[[which.max(output_list_verifications_count)]] 
dummy_host_verifications_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_host_verifications_cols)) %>% as.data.frame()
colnames(dummy_host_verifications_df) <- dummy_host_verifications_cols

dummy_amenities_cols <- output_list_amenities[[which.max(output_list_amenities_count)]] 
dummy_amenities_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_amenities_cols)) %>% as.data.frame()
colnames(dummy_amenities_df) <- dummy_amenities_cols

head(dummy_host_verifications_df)
##   email phone facebook reviews manual_offline jumio offline_government_id
## 1    NA    NA       NA      NA             NA    NA                    NA
## 2    NA    NA       NA      NA             NA    NA                    NA
## 3    NA    NA       NA      NA             NA    NA                    NA
## 4    NA    NA       NA      NA             NA    NA                    NA
## 5    NA    NA       NA      NA             NA    NA                    NA
## 6    NA    NA       NA      NA             NA    NA                    NA
##   sent_id selfie government_id identity_manual work_email
## 1      NA     NA            NA              NA         NA
## 2      NA     NA            NA              NA         NA
## 3      NA     NA            NA              NA         NA
## 4      NA     NA            NA              NA         NA
## 5      NA     NA            NA              NA         NA
## 6      NA     NA            NA              NA         NA

Finally I bind the new dataframes (dummy_host_verifications_df and dummy_amenities_df) to the original dataset (airbnb_data).

airbnb_data <- cbind(airbnb_data, dummy_host_verifications_df, dummy_amenities_df) 

Now I susbtitute the NA’s with Yes (if the apartment has that verification or amenitie, and no if it doesn’t). In this case I use a loop nested inside another loop (a loop for each list/variable), where the outer loop iterates over the length of the output_list_verifications / output_list_amenities and the inner one over the dummy columns that I have just created (which I specified manually). If the name of the jth variable is in the lth vector of the output list, then the value of the ith row and jth column of the airbnb_data corresponds to “Yes” (meaning that the variable name appears in the list).

for(l in 1:length(output_list_verifications)){
        for(j in 62:74){
                if(colnames(airbnb_data)[j] %in% output_list_verifications[[l]]){
                        airbnb_data[l, j] <- "Yes"
                } else{
                        
                }
        }
}

for(l in 1:length(output_list_amenities)){
        for(j in 75:ncol(airbnb_data)){
                if(colnames(airbnb_data)[j] %in% output_list_amenities[[l]]){
                        airbnb_data[l, j] <- "Yes"
                } else{
                        
                }
        }
}

Now I convert all the new columns into factors

for(j in 62:ncol(airbnb_data)){
        for(i in 1:nrow(airbnb_data)){
                if(is.na(airbnb_data[i, j]) == TRUE){
                        airbnb_data[i, j] <- "No"
                } 
        }
        airbnb_data[, j] <- factor(airbnb_data[, j], levels = c("Yes", "No"))
}

I delete some unuseful variables, and create new ones: days_since_host, days_since_first_review, days_since_last_review, which will allow me to introduce the time component into the model, since the statistical methods I am about to use, don’t deal well with dates and times.

airbnb_data$amenities <- NULL
airbnb_data$host_verifications <- NULL
airbnb_data$has_availability <- NULL
airbnb_data$is_business_travel_ready <- NULL
airbnb_data$days_since_host <- (today() - as.Date(airbnb_data$host_since)) %>% as.numeric()
airbnb_data$days_since_first_review <- (today() - as.Date(airbnb_data$first_review)) %>% as.numeric()
airbnb_data$days_since_last_review <- (today() - as.Date(airbnb_data$last_review)) %>% as.numeric()
airbnb_data$host_since <- NULL
airbnb_data$first_review <- NULL
airbnb_data$last_review <- NULL
airbnb_data$host_response_time <- NULL
airbnb_data$host_listings_count <- NULL

Now I delete the NA’s from these columns, which I will use to clean the other variables by substituting NA’s and strange values with the mean (in case of numerical variables), or the mode (in case of categorical ones).

airbnb_data <- airbnb_data[complete.cases(airbnb_data$host_since_year) & complete.cases(airbnb_data$country) & complete.cases(airbnb_data$city), ] 

I use the same list as before to spot NA’s, missing values and strange values:

airbnb_data_factor <- airbnb_data
for(i in 1:ncol(airbnb_data_factor)){
        airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
        factor_list <- lapply(airbnb_data_factor, levels)
}

library(listviewer)
jsonedit(factor_list)

The strange values are : "“,”N/A“, NA,”-“,”*“,”.“,”[no name]“,”.". Firstly, I create a for nested in another for, which iterates over columns (j), rows(i) and patterns (k). If it detects that a value that is not N/A from the ith row and jth column, contains any of the k patterns, then it converts that value in NA. This conversion will come in handy in the following chunk codes.

patterns <- c("N/A", "-", "*", "[no name]", ".", "")

for(j in 1:ncol(airbnb_data)){
        for(i in 1:nrow(airbnb_data)){
                for(k in 1:length(patterns)){
                        if(airbnb_data[i, j] == patterns[k] & !is.na(airbnb_data[i, j])){
                                airbnb_data[i, j] <- NA
                        }
                }
        }
}

I load the modeest package in order to be able to apply the mlv function (which calculates the mode of a vector) and also the stringr library, which I’ll use later. In this loop I iterate over columns (j) and rows(i). If the element from the ith row and jth column is NA, and is also a numeric value, it is substituted by the mean of the jth variable (after being filtered to the same year, country and city values of the ith row). If is NA and character or factor, then it is substituted by the mean of the jth variable (also after being filtered to the same year, country and city of the ith row).

library(modeest) 
library(stringr)

for(j in 1:ncol(airbnb_data)){
        for(i in 1:nrow(airbnb_data)){
                        if(is.na(airbnb_data[i, j]) == TRUE){
                                if(is.numeric(airbnb_data[, j]) == TRUE){
                                        airbnb_data[i, j] = median(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
                                                                               & airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"] , j])
                                } else if(is.character(airbnb_data[, j]) == TRUE | is.factor(airbnb_data[, j]) == TRUE){
                                        airbnb_data[i, j] = mlv(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
                                                                             & airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"], j])[1]
                        }
                }
        }
}

Then I filter out the possible NA’s left (in case there is any NA left) and substitute the " " spaces from the column names with "_", using an easy loop. Finally I delete unuseful columns (mainly factor columns containing a single level), and convert the neighbourhood variable into a factor.

airbnb_data <- airbnb_data[complete.cases(airbnb_data), ]
for(i in 1:ncol(airbnb_data)){
        colnames(airbnb_data)[i] <- gsub(" ", "_", colnames(airbnb_data)[i])
}

airbnb_data <- airbnb_data[, !duplicated(colnames(airbnb_data))] # we delete duplicated columns (because some dummy variables had very similar names and were detected by R as identical)

airbnb_data <- airbnb_data[airbnb_data$city == "Barcelona", ] # I filter out possible other cities (since there was a row containing a different city value)
airbnb_data$country <- NULL
airbnb_data$state <- NULL
airbnb_data$smart_location <- NULL
airbnb_data$city <- NULL # since there's only one city (Barcelona)
airbnb_data$requires_license <- NULL # since it is a factor with one level
airbnb_data$neighbourhood <- as.factor(airbnb_data$neighbourhood)

Exploratory Data Analysis

First of all I check that there are no NA’s left in the data:

airbnb_data[is.na(airbnb_data), ] %>% nrow()
## [1] 0

Now I use str() and summary() to check the structure of the data again:

str(airbnb_data)
## 'data.frame':    7364 obs. of  156 variables:
##  $ host_since_year                             : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ host_response_rate                          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ host_acceptance_rate                        : num  0.91 0.91 0.91 1 1 1 1 0.77 1 0.55 ...
##  $ host_is_superhost                           : chr  "FALSE" "FALSE" "FALSE" "TRUE" ...
##  $ host_neighbourhood                          : chr  "El Gòtic" "El Gòtic" "El Gòtic" "L'Antiga Esquerra de l'Eixample" ...
##  $ host_total_listings_count                   : int  3 3 3 4 4 4 4 1 7 7 ...
##  $ host_has_profile_pic                        : chr  "TRUE" "TRUE" "TRUE" "TRUE" ...
##  $ host_identity_verified                      : chr  "TRUE" "TRUE" "TRUE" "TRUE" ...
##  $ neighbourhood                               : Factor w/ 69 levels "","Camp d'en Grassot i Gràcia Nova",..: 18 18 5 8 28 28 28 5 8 2 ...
##  $ latitude                                    : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ longitude                                   : num  2.18 2.18 2.18 2.15 2.15 ...
##  $ is_location_exact                           : chr  "TRUE" "TRUE" "TRUE" "TRUE" ...
##  $ property_type                               : chr  "Apartment" "Apartment" "Apartment" "Apartment" ...
##  $ room_type                                   : chr  "Private room" "Private room" "Entire home/apt" "Private room" ...
##  $ accommodates                                : int  2 2 9 2 1 1 2 1 4 6 ...
##  $ bathrooms                                   : num  1 1 3 2 2 2 2 1 1 2 ...
##  $ bedrooms                                    : int  1 1 4 1 1 1 1 1 1 2 ...
##  $ beds                                        : int  1 1 6 1 1 1 1 1 2 4 ...
##  $ bed_type                                    : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ square_feet                                 : num  807 807 807 807 807 807 807 807 807 807 ...
##  $ price                                       : num  80 100 227 40 30 30 45 33 130 110 ...
##  $ security_deposit                            : num  100 150 200 0 0 0 0 0 150 500 ...
##  $ cleaning_fee                                : num  20 40 67 6 6 6 6 15 59 30 ...
##  $ guests_included                             : int  2 1 4 1 1 1 1 1 2 3 ...
##  $ extra_people                                : num  0 0 25 10 0 0 10 0 10 25 ...
##  $ minimum_nights                              : int  3 5 4 7 2 2 2 2 2 3 ...
##  $ maximum_maximum_nights                      : int  90 120 1125 1125 730 1125 1125 65 364 365 ...
##  $ availability_30                             : int  16 30 30 26 14 21 17 29 30 30 ...
##  $ availability_60                             : int  16 60 60 52 31 44 43 59 60 60 ...
##  $ availability_90                             : int  16 90 90 75 46 70 69 83 80 90 ...
##  $ availability_365                            : int  88 180 180 348 318 345 344 358 336 365 ...
##  $ number_of_reviews                           : int  2 8 149 303 238 258 222 73 339 39 ...
##  $ review_scores_rating                        : num  100 68 91 94 95 96 95 94 94 88 ...
##  $ review_scores_accuracy                      : num  10 8 10 10 10 10 10 10 10 9 ...
##  $ review_scores_cleanliness                   : num  10 8 9 9 10 10 9 10 10 9 ...
##  $ review_scores_checkin                       : num  10 7 10 10 10 10 10 10 9 9 ...
##  $ review_scores_communication                 : num  10 9 10 10 10 10 10 10 10 9 ...
##  $ review_scores_value                         : num  10 7 9 10 10 9 9 10 10 9 ...
##  $ instant_bookable                            : chr  "FALSE" "FALSE" "TRUE" "TRUE" ...
##  $ cancellation_policy                         : chr  "moderate" "moderate" "moderate" "moderate" ...
##  $ require_guest_profile_picture               : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
##  $ require_guest_phone_verification            : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
##  $ calculated_host_listings_count              : int  3 3 3 4 4 4 4 1 3 4 ...
##  $ calculated_host_listings_count_entire_homes : int  1 1 1 0 0 0 0 0 3 4 ...
##  $ calculated_host_listings_count_private_rooms: int  2 2 2 4 4 4 4 1 0 0 ...
##  $ calculated_host_listings_count_shared_rooms : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviews_per_month                           : num  0.05 0.07 1.26 3.01 2.16 2.62 3.17 0.69 3.09 0.4 ...
##  $ email                                       : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ phone                                       : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ facebook                                    : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 1 1 ...
##  $ reviews                                     : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ manual_offline                              : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 2 2 ...
##  $ jumio                                       : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ offline_government_id                       : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 2 2 2 ...
##  $ sent_id                                     : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ selfie                                      : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ government_id                               : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ identity_manual                             : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ work_email                                  : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 2 ...
##  $ TV                                          : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Cable_TV                                    : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
##  $ Internet                                    : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wifi                                        : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Air_conditioning                            : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 1 1 ...
##  $ Wheelchair_accessible                       : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Kitchen                                     : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Paid_parking_off_premises                   : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
##  $ Elevator                                    : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Buzzerwireless_intercom                     : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 1 1 1 1 ...
##  $ Heating                                     : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Familykid_friendly                          : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 1 ...
##  $ Washer                                      : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 2 1 1 ...
##  $ Smoke_alarm                                 : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
##  $ Carbon_monoxide_alarm                       : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
##  $ First_aid_kit                               : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Safety_card                                 : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Fire_extinguisher                           : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
##  $ Essentials                                  : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Shampoo                                     : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Hangers                                     : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ Hair_dryer                                  : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ Iron                                        : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ Laptop-friendly_workspace                   : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Private_entrance                            : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Outlet_covers                               : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ High_chair                                  : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
##  $ Stair_gates                                 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Children’s_books_and_toys                 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Window_guards                               : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Table_corner_guards                         : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Babysitter_recommendations                  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Crib                                        : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
##  $ Pack_’n_Playtravel_crib                   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Room-darkening_shades                       : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Children’s_dinnerware                     : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Hot_water                                   : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
##  $ Body_soap                                   : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bath_towel                                  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Toilet_paper                                : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##   [list output truncated]
summary(airbnb_data)
##  host_since_year host_response_rate host_acceptance_rate host_is_superhost 
##  Min.   :2009    Min.   :0.0000     Min.   :0.0000       Length:7364       
##  1st Qu.:2012    1st Qu.:1.0000     1st Qu.:0.8900       Class :character  
##  Median :2013    Median :1.0000     Median :0.9800       Mode  :character  
##  Mean   :2013    Mean   :0.9348     Mean   :0.8748                         
##  3rd Qu.:2013    3rd Qu.:1.0000     3rd Qu.:1.0000                         
##  Max.   :2014    Max.   :1.0000     Max.   :1.0000                         
##                                                                            
##  host_neighbourhood host_total_listings_count host_has_profile_pic
##  Length:7364        Min.   :  0.00            Length:7364         
##  Class :character   1st Qu.:  1.00            Class :character    
##  Mode  :character   Median :  3.00            Mode  :character    
##                     Mean   : 15.68                                
##                     3rd Qu.: 17.00                                
##                     Max.   :170.00                                
##                                                                   
##  host_identity_verified             neighbourhood     latitude    
##  Length:7364            Eixample           :1607   Min.   :41.35  
##  Class :character       Ciutat Vella       : 796   1st Qu.:41.38  
##  Mode  :character       Sants-Montjuïc     : 666   Median :41.39  
##                         Dreta de l'Eixample: 400   Mean   :41.39  
##                         Gràcia             : 393   3rd Qu.:41.40  
##                         Sant Martí         : 374   Max.   :41.46  
##                         (Other)            :3128                  
##    longitude     is_location_exact  property_type       room_type        
##  Min.   :2.089   Length:7364        Length:7364        Length:7364       
##  1st Qu.:2.157   Class :character   Class :character   Class :character  
##  Median :2.167   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.167                                                           
##  3rd Qu.:2.177                                                           
##  Max.   :2.222                                                           
##                                                                          
##   accommodates      bathrooms         bedrooms           beds       
##  Min.   : 1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 2.000  
##  Mean   : 3.789   Mean   : 1.393   Mean   : 1.769   Mean   : 2.574  
##  3rd Qu.: 5.000   3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :20.000   Max.   :16.000   Max.   :16.000   Max.   :40.000  
##                                                                     
##    bed_type          square_feet         price       security_deposit
##  Length:7364        Min.   :   0.0   Min.   : 10.0   Min.   :  0.0   
##  Class :character   1st Qu.: 180.0   1st Qu.: 40.0   1st Qu.:100.0   
##  Mode  :character   Median : 431.0   Median : 70.0   Median :160.0   
##                     Mean   : 358.8   Mean   :101.5   Mean   :193.2   
##                     3rd Qu.: 538.0   3rd Qu.:120.0   3rd Qu.:250.0   
##                     Max.   :2799.0   Max.   :999.0   Max.   :999.0   
##                                                                      
##   cleaning_fee    guests_included    extra_people    minimum_nights   
##  Min.   :  0.00   Min.   :  1.000   Min.   :  0.00   Min.   :   1.00  
##  1st Qu.: 20.00   1st Qu.:  1.000   1st Qu.:  0.00   1st Qu.:   2.00  
##  Median : 40.00   Median :  1.000   Median :  0.00   Median :   3.00  
##  Mean   : 50.52   Mean   :  2.087   Mean   :  9.84   Mean   :  10.81  
##  3rd Qu.: 65.00   3rd Qu.:  2.000   3rd Qu.: 18.00   3rd Qu.:  10.00  
##  Max.   :540.00   Max.   :150.000   Max.   :268.00   Max.   :1124.00  
##                                                                       
##  maximum_maximum_nights availability_30 availability_60 availability_90
##  Min.   :1.000e+00      Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:3.000e+02      1st Qu.: 0.00   1st Qu.: 2.00   1st Qu.:13.00  
##  Median :1.125e+03      Median :26.00   Median :49.00   Median :74.00  
##  Mean   :5.840e+05      Mean   :18.25   Mean   :37.25   Mean   :57.19  
##  3rd Qu.:1.125e+03      3rd Qu.:30.00   3rd Qu.:59.00   3rd Qu.:89.00  
##  Max.   :2.147e+09      Max.   :30.00   Max.   :60.00   Max.   :90.00  
##                                                                        
##  availability_365 number_of_reviews review_scores_rating review_scores_accuracy
##  Min.   :  0.0    Min.   :  0.00    Min.   : 20.00       Min.   : 2.000        
##  1st Qu.: 89.0    1st Qu.:  1.00    1st Qu.: 90.00       1st Qu.: 9.000        
##  Median :209.0    Median : 15.00    Median : 94.00       Median :10.000        
##  Mean   :206.7    Mean   : 56.84    Mean   : 92.14       Mean   : 9.517        
##  3rd Qu.:346.0    3rd Qu.: 78.00    3rd Qu.: 96.00       3rd Qu.:10.000        
##  Max.   :365.0    Max.   :731.00    Max.   :100.00       Max.   :10.000        
##                                                                                
##  review_scores_cleanliness review_scores_checkin review_scores_communication
##  Min.   : 2.000            Min.   : 2.000        Min.   : 2.000             
##  1st Qu.: 9.000            1st Qu.:10.000        1st Qu.:10.000             
##  Median : 9.000            Median :10.000        Median :10.000             
##  Mean   : 9.335            Mean   : 9.699        Mean   : 9.695             
##  3rd Qu.:10.000            3rd Qu.:10.000        3rd Qu.:10.000             
##  Max.   :10.000            Max.   :10.000        Max.   :10.000             
##                                                                             
##  review_scores_value instant_bookable   cancellation_policy
##  Min.   : 2.000      Length:7364        Length:7364        
##  1st Qu.: 9.000      Class :character   Class :character   
##  Median : 9.000      Mode  :character   Mode  :character   
##  Mean   : 9.073                                            
##  3rd Qu.:10.000                                            
##  Max.   :10.000                                            
##                                                            
##  require_guest_profile_picture require_guest_phone_verification
##  Length:7364                   Length:7364                     
##  Class :character              Class :character                
##  Mode  :character              Mode  :character                
##                                                                
##                                                                
##                                                                
##                                                                
##  calculated_host_listings_count calculated_host_listings_count_entire_homes
##  Min.   :  1.00                 Min.   :  0.00                             
##  1st Qu.:  1.00                 1st Qu.:  0.00                             
##  Median :  3.00                 Median :  1.00                             
##  Mean   : 15.44                 Mean   : 12.86                             
##  3rd Qu.: 18.00                 3rd Qu.: 14.00                             
##  Max.   :132.00                 Max.   :132.00                             
##                                                                            
##  calculated_host_listings_count_private_rooms
##  Min.   : 0.000                              
##  1st Qu.: 0.000                              
##  Median : 0.000                              
##  Mean   : 2.227                              
##  3rd Qu.: 1.000                              
##  Max.   :81.000                              
##                                              
##  calculated_host_listings_count_shared_rooms reviews_per_month email     
##  Min.   :0.00000                             Min.   : 0.010    Yes:7180  
##  1st Qu.:0.00000                             1st Qu.: 0.340    No : 184  
##  Median :0.00000                             Median : 0.800              
##  Mean   :0.02879                             Mean   : 1.214              
##  3rd Qu.:0.00000                             3rd Qu.: 1.660              
##  Max.   :8.00000                             Max.   :25.450              
##                                                                          
##  phone      facebook   reviews    manual_offline jumio     
##  Yes:7354   Yes:1673   Yes:6945   Yes: 433       Yes:5709  
##  No :  10   No :5691   No : 419   No :6931       No :1655  
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##  offline_government_id sent_id    selfie     government_id identity_manual
##  Yes:3275              Yes: 117   Yes:2249   Yes:6076      Yes:1986       
##  No :4089              No :7247   No :5115   No :1288      No :5378       
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  work_email   TV       Cable_TV   Internet    Wifi      Air_conditioning
##  Yes:1452   Yes:   0   Yes: 908   Yes:2675   Yes:7218   Yes:4949        
##  No :5912   No :7364   No :6456   No :4689   No : 146   No :2415        
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  Wheelchair_accessible Kitchen    Paid_parking_off_premises Elevator  
##  Yes: 521              Yes:6741   Yes:2520                  Yes:4354  
##  No :6843              No : 623   No :4844                  No :3010  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  Buzzerwireless_intercom Heating    Familykid_friendly Washer     Smoke_alarm
##  Yes:2393                Yes:6264   Yes:3631           Yes:6194   Yes:1732   
##  No :4971                No :1100   No :3733           No :1170   No :5632   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Carbon_monoxide_alarm First_aid_kit Safety_card Fire_extinguisher Essentials
##  Yes:1324              Yes:1617      Yes: 525    Yes:1641          Yes:6909  
##  No :6040              No :5747      No :6839    No :5723          No : 455  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Shampoo    Hangers    Hair_dryer  Iron      Laptop-friendly_workspace
##  Yes:4480   Yes:6153   Yes:6009   Yes:5802   Yes:4424                 
##  No :2884   No :1211   No :1355   No :1562   No :2940                 
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  Private_entrance Outlet_covers High_chair Stair_gates
##  Yes:1005         Yes:  80      Yes:1519   Yes:  20   
##  No :6359         No :7284      No :5845   No :7344   
##                                                       
##                                                       
##                                                       
##                                                       
##                                                       
##  Childrenâ\200\231s_books_and_toys Window_guards Table_corner_guards
##  Yes: 512                    Yes: 113      Yes:  22           
##  No :6852                    No :7251      No :7342           
##                                                               
##                                                               
##                                                               
##                                                               
##                                                               
##  Babysitter_recommendations  Crib      Pack_â\200\231n_Playtravel_crib
##  Yes: 319                   Yes:1756   Yes: 975                 
##  No :7045                   No :5608   No :6389                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Room-darkening_shades Childrenâ\200\231s_dinnerware Hot_water  Body_soap  Bath_towel
##  Yes: 887              Yes: 179                Yes:5115   Yes: 159   Yes: 159  
##  No :6477              No :7185                No :2249   No :7205   No :7205  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  Toilet_paper Bed_linens Extra_pillows_and_blankets Ethernet_connection
##  Yes: 159     Yes:3544   Yes:2090                   Yes: 416           
##  No :7205     No :3820   No :5274                   No :6948           
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  Pocket_wifi Microwave  Coffee_maker Refrigerator Dishwasher
##  Yes: 469    Yes:3995   Yes:3996     Yes:4367     Yes:2066  
##  No :6895    No :3369   No :3368     No :2997     No :5298  
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##  Dishes_and_silverware Cooking_basics  Oven      Stove      Single_level_home
##  Yes:4370              Yes:3732       Yes:3255   Yes:3268   Yes: 408         
##  No :2994              No :3632       No :4109   No :4096   No :6956         
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Patio_or_balcony Luggage_dropoff_allowed Wide_hallways
##  Yes:2504         Yes:1778                Yes: 119     
##  No :4860         No :5586                No :7245     
##                                                        
##                                                        
##                                                        
##                                                        
##                                                        
##  No_stairs_or_steps_to_enter Wide_entrance_for_guests
##  Yes: 220                    Yes: 123                
##  No :7144                    No :7241                
##                                                      
##                                                      
##                                                      
##                                                      
##                                                      
##  Flat_path_to_guest_entrance Well-lit_path_to_entrance Extra_space_around_bed
##  Yes: 104                    Yes: 188                  Yes:  57              
##  No :7260                    No :7176                  No :7307              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Accessible-height_bed Accessible-height_toilet Wide_clearance_to_shower
##  Yes:  59              Yes:  31                 Yes:  10                
##  No :7305              No :7333                 No :7354                
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  _toilet    Wide_entryway Host_greets_you Waterfront Beachfront
##  Yes:  10   Yes:  45      Yes:3285        Yes: 169   Yes: 133  
##  No :7354   No :7319      No :4079        No :7195   No :7231  
##                                                                
##                                                                
##                                                                
##                                                                
##                                                                
##  Handheld_shower_head Hot_water_kettle Ceiling_fan Beach_view Rain_shower
##  Yes:  38             Yes:  68         Yes:  12    Yes:   2   Yes:  38   
##  No :7326             No :7296         No :7352    No :7362   No :7326   
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Bidet      Heated_towel_rack Balcony    Printer    Espresso_machine
##  Yes:  26   Yes:  30          Yes:  71   Yes:   7   Yes:  85        
##  No :7338   No :7334          No :7293   No :7357   No :7279        
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  Formal_dining_area Day_bed    Convection_oven Standing_valet
##  Yes:  34           Yes:  19   Yes:  40        Yes:  12      
##  No :7330           No :7345   No :7324        No :7352      
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##  Pillow-top_mattress Memory_foam_mattress En_suite_bathroom Outdoor_seating
##  Yes:  27            Yes:  36             Yes:  44          Yes:  35       
##  No :7337            No :7328             No :7320          No :7329       
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##  Mudroom    Full_kitchen Paid_parking_on_premises Bedroom_comforts
##  Yes:   9   Yes: 184     Yes:1377                 Yes: 160        
##  No :7355   No :7180     No :5987                 No :7204        
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  Bathroom_essentials Fixed_grab_bars_for_shower Shower_chair days_since_host
##  Yes: 161            Yes:   7                   Yes:   3     Min.   :2100   
##  No :7203            No :7357                   No :7361     1st Qu.:2477   
##                                                              Median :2745   
##                                                              Mean   :2791   
##                                                              3rd Qu.:3071   
##                                                              Max.   :4212   
##                                                                             
##  days_since_first_review days_since_last_review
##  Min.   : 107            Min.   : 107.0        
##  1st Qu.: 895            1st Qu.: 205.0        
##  Median :1527            Median : 222.0        
##  Mean   :1512            Mean   : 353.1        
##  3rd Qu.:2032            3rd Qu.: 340.0        
##  Max.   :3734            Max.   :3386.0        
## 

Now I check the distribution of price, the variable that I will try to predict. I use the log of price, since it allows us to see the data in a more compact way. The shape seems pretty gaussian, but a little bit skewed to the left. The Average price is 101.5, the minimum 10 and the maximum about 1000 (all the prices are in USD Dollars)

ggplot(data = airbnb_data, aes(x = log(price))) + 
        geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue") + 
        ggtitle("AirBnb Price per Night (Log scale)") + 
        theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
              axis.title = element_text(hjust = 0.5, size = 12),
              axis.text = element_text(size = 12)) + 
        xlab("Log of Price per Night") + ylab("Number of apartments")

Categorical variables

In this section I will be exploring some categorical variables, as well as their relationship with price, the variable I am intended to predict. First I aggregate the data to take the average night price by neighbourhood. Then I subtract the first row, since the value is clearly an outlier with an average price of $800 per night.

night_neighbourhood <- airbnb_data %>% group_by(neighbourhood) %>% summarize(avg_night_price = mean(price)) %>% arrange(desc(avg_night_price))
night_neighbourhood <- night_neighbourhood[-1, ] # first row is clearly an outlier

Now I plot the average price per night of the 10 most expensive (up) and the 10 cheapest (down) neighbourhoods in Barcelona

ggplot(night_neighbourhood[1:10, ], aes(x = reorder(neighbourhood, avg_night_price),  y = avg_night_price, 
                                        fill = avg_night_price)) + 
        geom_bar(stat = "identity") + ggtitle("Top 10 most expensive neighbourhoods") + 
        theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
              axis.title = element_text(hjust = 0.5, size = 12),
              axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0("$", x)) +
        xlab("") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "yellow", high = "red")

ggplot(night_neighbourhood[(nrow(night_neighbourhood) - 10) : nrow(night_neighbourhood), ], aes(x = reorder(neighbourhood, avg_night_price),  y = avg_night_price, 
                                        fill = avg_night_price)) + 
        geom_bar(stat = "identity") + ggtitle("Top 10 cheapest neighbourhoods") + 
        theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
              axis.title = element_text(hjust = 0.5, size = 12),
              axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0(x, "$")) +
        xlab("Neighbourhood") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "green",  high = "blue")

In order to see all this insights clearly, I create a map, using the geospatial data from Barcelona, which I will obtain from the neighbourhoods_geojson dataset. This geospatial data is loaded as a “SpatialPolygonsDataFrame”, which is a kind of dataframe that includes polygons, which are used to create maps in R. This dataset includes a dataframe containing the neighbourhoods, as well as the spatial polygons:

summary(neighbourhoods_geojson) 
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x  2.05247  2.22922
## y 41.31696 41.46828
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Data attributes:
##  neighbourhood      neighbourhood_group
##  Length:75          Length:75          
##  Class :character   Class :character   
##  Mode  :character   Mode  :character

Now I correct the misspellings and convert the variable neighbourhood of the dataframe data of neighbourhoods_geojson into a factor. Then I join this data with the night_neighbourhood dataframe I created previously, and use a loop to substitute the missing values of the price variable, with the mean of the available ones. This way, I obtain a dataset with each neighbourhood, its spatial polygons and its average night price.

neighbourhoods_geojson@data$neighbourhood <- as.factor(neighbourhoods_geojson@data$neighbourhood)
levels(neighbourhoods_geojson$neighbourhood) <- c("Baró de Viver" ,                               "Can Baró" ,                                   
                                                  "Can Peguera" ,                                  "Canyelles",                                    
                                                  "Ciutat Meridiana"   ,                           "Diagonal Mar i el Front Marítim del Poblenou",
                                                  "El Baix Guinardó"  ,                           "El Barri Gòtic"    ,                          
                                                  "El Besós i el Maresme",                        "El Bon Pastor"   ,                             
                                                  "El Camp d'en Grassot i Gràcia Nova" ,          "El Camp de l'Arpa del Clot"  ,                 
                                                  "El Carmel"       ,                              "El Clot"  ,                                    
                                                  "El Coll"   ,                                    "El Congrés i els Indians"   ,                 
                                                  "El Fort Pienc" ,                                "El Guinardó"  ,                               
                                                  "El Parc i la Llacuna del Poblenou" ,            "El Poble Sec"  ,                               
                                                  "El Poblenou"   ,                                "El Putxet i el Farró"  ,                      
                                                  "El Raval"   ,                                   "El Turó de la Peira",                         
                                                  "Horta"    ,                                     "Hostafrancs" ,                                 
                                                  "L'Antiga Esquerra de l'Eixample" ,              "La Barceloneta" ,                              
                                                  "La Bordeta"     ,                               "La Clota" ,                                    
                                                  "La Dreta de l'Eixample"  ,                      "La Font d'en Fargues"  ,                       
                                                  "La Font de la Guatlla"  ,                       "La Guineueta"    ,                             
                                                  "La Marina de Port"  ,                           "La Marina del Prat Vermell",                   
                                                  "La Maternitat i Sant Ramon" ,                   "La Nova Esquerra de l'Eixample" ,              
                                                  "La Prosperitat"  ,                              "La Sagrada Família" ,                         
                                                  "La Sagrera"   ,                                 "La Salut"   ,                                  
                                                  "La Teixonera"   ,                               "La Trinitat Nova" ,                            
                                                  "La Trinitat Vella"  ,                           "La Vall d'Hebron" ,                            
                                                  "La Verneda i la Pau"  ,                         "La Vila de Gràcia",                           
                                                  "La Vila Olímpica del Poblenou",                "Les Corts"    ,                                
                                                  "Les Roquetes"  ,                                "Les Tres Torres",                              
                                                  "Montbau" ,                                      "Navas",                                        
                                                  "Pedralbes"   ,                                  "Porta"  ,                                      
                                                  "Provençals del Poblenou"  ,                    "Sant Andreu"  ,                                
                                                  "Sant Antoni"   ,                                "Sant Genís dels Agudells"  ,                  
                                                  "Sant Gervasi - Galvany"  ,                      "Sant Gervasi - la Bonanova"  ,                 
                                                  "Sant Martí de Provençals"  ,                  "Sant Pere, Santa Caterina i la Ribera",        
                                                  "Sants"       ,                                  "Sants - Badal"    ,                            
                                                  "Sarrià "    ,                                   "Torre Baró" ,                                 
                                                  "Vallbona"    ,                                  "Vallcarca i els Penitents" ,                   
                                                  "Vallvidrera, el Tibidabo i les Planes"   ,      "Verdun"  ,                                     
                                                  "Vilapicina i la Torre Llobeta"  )         

neighbourhoods_geojson@data <- left_join(neighbourhoods_geojson@data, night_neighbourhood[, 1:2]) %>% as.data.frame()

for(i in 1:nrow(neighbourhoods_geojson@data)){
        if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == TRUE){
                neighbourhoods_geojson@data[i, 3] <- mean(neighbourhoods_geojson@data[complete.cases(neighbourhoods_geojson@data), 3])
        } else if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == FALSE){
        }
}

I load the leaflet package to create an interactive map. This map shows the Average Airbnb Night Price by Neighbourhood in Barcelona.

library(leaflet)
pal <- colorNumeric(
        palette = "YlGnBu",
        domain = neighbourhoods_geojson@data$avg_night_price
)


price_per_neighbourhood <- leaflet(neighbourhoods_geojson) %>%
        addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 11.5) %>%
        addPolygons(stroke = TRUE, fillColor = ~ pal(avg_night_price), fillOpacity = 0.8,
                    highlight = highlightOptions(weight = 2,
                                                 color = ~ pal(avg_night_price), 
                                                 fillOpacity = 1,
                                                 bringToFront = TRUE),
                    label = ~neighbourhood,
                    smoothFactor = 0.2,
                    popup = ~ paste(paste(neighbourhood,":"), "<br/>","<b/>", paste("Avg Night Price: ", "$", round(avg_night_price)))) %>%
        addLegend("bottomright", pal = pal, values = ~avg_night_price, opacity = 1.0, 
                  title = "Average Airbnb Night Price",
                  labFormat = labelFormat(prefix = "$"), na.label="")
price_per_neighbourhood

As we can see, the most expensive neighbourhoods of Barcelona seem to be in the northwest and center parts of the city, and the cheapeast ones are most likely located in the northeast and south. This may be due to the presence of turistic attractions, mainly in the center (Plaça Catalunya, Portal de l’Àngel, Passeig de Gràcia) and others such as Sagrada Familia, where the average night price is $125.

Now I create another map, this time based on the type of property:

pal3 <- colorFactor(palette = c(
        "dodgerblue2", "#E31A1C", 
        "green4",
        "#6A3D9A", 
        "#FF7F00", 
        "black", "gold1",
        "skyblue2", "#FB9A99", 
        "palegreen2",
        "#CAB2D6", 
        "#FDBF6F", 
        "gray70", "khaki2",
        "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
        "darkturquoise", "green1"), 
        domain = airbnb_data$property_type)


typeofproperty <- list()
for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
        typeofproperty[[i]] <- airbnb_data %>% dplyr::filter(property_type == levels(as.factor(airbnb_data$property_type))[i])
}
names(typeofproperty) <- levels(as.factor(airbnb_data$property_type))

typeofproperty_map <- leaflet() %>% addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 13)

for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
        typeofproperty_map <- typeofproperty_map %>% addCircles(data = typeofproperty[[i]], lat = ~latitude, 
                                          lng = ~longitude, color = ~pal3(property_type), 
                                          fillOpacity = 1, label = ~property_type, 
                                          popup = ~price, group = levels(as.factor(airbnb_data$property_type))[i])
}

typeofproperty_map <- typeofproperty_map %>% addLegend(data = airbnb_data, "topleft", 
                                                         pal = pal3, values = ~property_type, title = "Property Type", 
                                                         opacity = 1, group = "Legend")

groups <- c("Legend", levels(as.factor(airbnb_data$property_type)))
typeofproperty_map <- typeofproperty_map %>% addLayersControl(overlayGroups = groups, 
                                                              options = layersControlOptions(collapsed = TRUE))
typeofproperty_map

Apartments are the most common Airbnb Property Types, being spread all over the city, condominiums lofts and aparthotels are also pretty extended.

Numerical variables

Now I check the correlations of price and the other numerical variables.

library(corrr)

numeric_correlations <- select_if(airbnb_data, is.numeric) %>% correlate() %>% focus(price)

numeric_correlations_plot <- ggplot(data = numeric_correlations, aes(x = reorder(rowname, price), y = price)) + 
        geom_bar(aes(fill = price), stat = "identity") + 
        scale_fill_gradient(low = "blue", high = "red") +
        ggtitle("Correlation of price and the other numeric variables") +
        theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
              axis.title = element_text(hjust = 0.5, size = 14),
              axis.text.x = element_text(size = 14, angle = 90),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Numeric variable") + ylab("Correlation") + coord_flip()
numeric_correlations_plot

The variables that seem to have more dependency on price are: accomodates, bedrooms, beds and guests_included. This means that these factors may affect the price more than other variables.

Modeling

In this problem I am going to be trying 6 different Statistical Learning Models: Ridge Regression, Lasso Regression, Bagging, Random Forest, Boosting and an Ensemble Model created averaging the other models. In order to apply all this models, I’ll be using a training/testing partition approach: I’ll divide the existing data in two partitions: 60% for the training data, where I’ll train and optimize the model and 40% for the testing, where I’ll analyze the performance of each method.

set.seed(1)
train <- sample(1:nrow(airbnb_data), nrow(airbnb_data)/2)
test <- -train

Ridge Regression

Ridge regression is a type of regularized linear regression, which includes a squared penalty, multiplied by a shrinkage parameter lambda, which has to be chosen by cross-validation. It is basically a regression problem where the goal is to find the coefficients (betas) to optimize a penalized Residual Sum of Squares Formula (as seen in the following chunk):

The coefficients are those that minimize the previous formula: Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman

In matrix notation this is equivalent to: Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman

Here I define x as a matrix (-1 because we exclude the intercept) and y as a vector of prices of airbnb. Then I perform cross-validation to find the optimal value of lambda.

library(glmnet)
x <- model.matrix(price ~., airbnb_data)[, -1]
y <- airbnb_data$price

cv.out_ridge <- cv.glmnet(x[train, ], y[train], alpha = 0)
cv.out_ridge$lambda.min
## [1] 226.1069
par(mfrow = c(1,1))
plot(cv.out_ridge, main = "10-CV to choose Best Lambda") 

Here we can see the different values of lambda (the shrinkage penalty) and their related MSE.

Now we create a model with the best lambda, make the predictions on the test set, and compare the results to the test result using the RMSE (Root Mean Squared Error), as the squared root of the mean of the sum of all the predicted values minus the test values all squared. Here we can see the best lambda and the rmse

bestlam_ridge <- cv.out_ridge$lambda.min
paste("The Best Lambda is: ", bestlam_ridge)
## [1] "The Best Lambda is:  226.106905212365"
ridge.mod <- glmnet(x[train, ], y[train], alpha = 0, lambda = bestlam_ridge)
ridge.pred <- predict(ridge.mod, s = bestlam_ridge, newx = x[test,])
rmse_ridge <- sqrt(mean((ridge.pred - y[test]) ^ 2 ))
paste("RMSE: ", rmse_ridge)
## [1] "RMSE:  86.2334121237272"

Now we check how well the model fitted the data visually, by plotting the test values versus the predicted values. As we can see there is kind of a correlation between the test and the predicted values, and it is not bad as a starting model.

ridge_prediction <- cbind(ridge.pred, y[test]) %>% as.data.frame()
colnames(ridge_prediction) <- c("Prediction", "Test_value")

ridge_accuracy_plot <- ggplot(data = ridge_prediction, aes(x = Test_value, y = Prediction)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Ridge Regression: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value") 
ridge_accuracy_plot

Lasso Regression

Lasso regression is a type of regularized linear regression, which includes an absolute value penalty, multiplied by a shrinkage parameter lambda, which has to be chosen by cross-validation. Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman

Here I perform cross-validation to find the optimal value of lambda. In the plot we can see that the lambda that minimizes the MSE might be between exp(-2) and exp(0) = 1.

cv.out_lasso <- cv.glmnet(x[train, ], y[train], alpha = 1)

par(mfrow = c(1,1))
plot(cv.out_lasso, main = "10-CV to choose Best Lambda") # ens dona el menor valor de lambda pel cual

Here we can see the different values of lambda (the shrinkage penalty) and their related MSE.

Now I perform a lasso regression using the best lambda in the function glmnet(). The model improves the performance of the ridge, since it has lower RMSE.

bestlam_lasso <- cv.out_lasso$lambda.min
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = bestlam_lasso)
lasso.pred <- predict(lasso.mod, s = bestlam_lasso, newx = x[test,])
rmse_lasso <- sqrt(mean((lasso.pred - y[test]) ^ 2 )) 
paste("The Best Lambda is: ", bestlam_lasso)
## [1] "The Best Lambda is:  1.32431284875504"

Now I plot the test values versus the predicted values, and I observe a similar pattern than in the ridge, there is a correlation, but it isn’t very strong yet.

lasso_prediction <- cbind(lasso.pred, y[test]) %>% as.data.frame()
colnames(lasso_prediction) <- c("Prediction", "Test_value")

lasso_accuracy_plot <- ggplot(data = lasso_prediction, aes(x = Test_value, y = Prediction)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Lasso Regression: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value")
lasso_accuracy_plot

Bagging

Bagging stands for Bootstraped Aggregation, and basically averages the predictions obtained from n regression trees (each regression tree is trained in a different bootstraped subset of the data, but without restrictions in the number of predictors considered at each split step). The formula behind this algorithm is the following one: Source: The Elements of Statistical Learning, 2nd Edition by Robert Hastie, Trevor Tibshirani and Jerome Friedman

Before using any function, I have to make sure the data is well spelled, since the function I am going to use doesn’t accept empty spaces or numbers in the column names.

colnames(airbnb_data)[c(28, 29, 30, 31)] <- c("availability_onemonth", "availability_twomonths", "availability_threemonths", "availability_oneyear")
colnames(airbnb_data)[c(88, 93, 95, 124)] <- c("Childrens_books_and_toys", "Pack_Playtravel_crib", "Childrens_dinnerware", "Toilet")

for(i in 1 : ncol(airbnb_data)){
        colnames(airbnb_data)[i] <- gsub("-", "_", colnames(airbnb_data)[i])
}

In order to use Tree methods, R doesn’t allow to include in the model variables with more than 30 factors, that’s why I’ll have to exclude some columns during this process. The Bagging algorithm basically creates lots of regression trees, and predicts the average of all trees. The standard number of trees used is more or less 1000. In the following chunk I train the model, and evaluate the errors.

library(randomForest)
bagging_airbnb <- randomForest::randomForest(price ~., data = select(airbnb_data, -host_neighbourhood & - neighbourhood), 
                                                  subset = train, mtry = ncol(airbnb_data) - 1, keep.forest = T, ntree = 1000) 
yhat.bagging <- predict(bagging_airbnb, newdata = airbnb_data[test, ])
rmse_bagging <- sqrt(mean((airbnb_data$price[test] - yhat.bagging)^2))
plot_bag_data <- data.frame(test_value = airbnb_data$price[test], bag.pred = yhat.bagging)

Now I plot the predicted values vs the test values, in order to see how well is the model performing.

bag_accuracy_plot <- ggplot(data = plot_bag_data, aes(x = test_value, y = bag.pred, color = test_value)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Bagging: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value") 
bag_accuracy_plot

The model seems pretty good, since there is a relationship between predicted and test values, but it could be much more better.

Random Forest

I use the same methodology than in the bagging implementation. This model looks even more robust than the previous due to the stronger linear relationship between predicted and the real values. Random Forest is basically the same as Bagging, but with the difference that at each step, the ith regression tree can only consider a random subset of variables.

rf_airbnb_sqrtpred <- randomForest::randomForest(price ~., data = select(airbnb_data, -host_neighbourhood & - neighbourhood), 
                                             subset = train, mtry = sqrt(ncol(airbnb_data) - 1), keep.forest = T, ntree = 1000) 
yhat.rf_airbnb_sqrtpred <- predict(rf_airbnb_sqrtpred, newdata = airbnb_data[test, ])
rmse_rf_sqrtpred <- sqrt(mean((airbnb_data$price[test] - yhat.rf_airbnb_sqrtpred)^2))
plot_rfsqrt_data <- data.frame(test_value = airbnb_data$price[test], rfsqrt.pred = yhat.rf_airbnb_sqrtpred)

rfsqrt_accuracy_plot <- ggplot(data = plot_rfsqrt_data, aes(x = test_value, y = rfsqrt.pred)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Random Forest: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value") 
rfsqrt_accuracy_plot

This model looks even more robust than the previous due to the stronger linear relationship between predicted and the real values.

Gradient Boosting

Gradient boosting is also the result of combining n regression trees. It starts by predicting the variable with its mean, and then constructs n trees, predicting the residuals (the difference between price and its predicted value). This way each regression tree takes into account the errors from the previous one, creating a very robust model. The mathematical description of the Gradient Boosting algorithm is the following one: Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman

In order to compute this algorithm, firsly I convert all characters into factors, since we can’t use characters in the formula.

for(j in 1:ncol(airbnb_data)){
        if(is.character(airbnb_data[, j]) == TRUE){
                airbnb_data[, j] <- as.factor(airbnb_data[, j])
        }
}

I apply the algorithm following a similar procedure to the previous models, and obtain the following plot:

library(gbm)
gbm_airbnb_data <- gbm(price ~., data = airbnb_data[train, ], distribution = "gaussian", n.trees = 1000)
yhat.gboosting <- predict(gbm_airbnb_data, newdata = airbnb_data[test, ])
rmse_gboosting <- sqrt(mean((airbnb_data$price[test] - yhat.gboosting)^2))
plot_gboosting_data <- data.frame(test_value = airbnb_data$price[test], gboosting.pred = yhat.gboosting)
gboosting_accuracy_plot <- ggplot(data = plot_gboosting_data, aes(x = test_value, y = gboosting.pred, color = test_value)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Gradient Boosting: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value") 
gboosting_accuracy_plot

The model seems to be performing worse than the last two, but better than the ridge and the lasso.

Ensemble Model

I create an Ensemble Model combining the Ridge, Lasso, Bagging, Random Forest and Boosting results.

ensemble_prediction <- ridge.pred + lasso.pred + yhat.bagging + yhat.rf_airbnb_sqrtpred + yhat.gboosting / 5
ensemble_rmse <- sqrt(mean((ensemble_prediction - airbnb_data$price[test]) ^2))
ensemble_data <- data.frame(ensemble = ensemble_prediction, testvalue = airbnb_data$price[test])
ensemble_accuracy_plot <- ggplot(data = ensemble_data, aes(x = testvalue, y = ensemble_prediction, color = test_value)) + 
        geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
        ggtitle("Ensemble Model: Prediction Accuracy") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank()) + 
        xlab("Test Value") + ylab("Predicted Value")
ensemble_accuracy_plot

It performs very poorly, it is the worst method of all.

Model Comparison & final decision

I create a dataframe storing all the RMSE results:

rmse_comparison <- rbind(ridge = round(rmse_ridge), 
                              lasso = round(rmse_lasso),
                              bagging = round(rmse_bagging), 
                              randomforest = round(rmse_rf_sqrtpred), 
                              gboosting = round(rmse_gboosting), 
                         ensemble = round(ensemble_rmse)) %>% as.data.frame()

colnames(rmse_comparison)[1] <- "RMSE"
rmse_comparison$method <- rownames(rmse_comparison)
rownames(rmse_comparison) <- 1:nrow(rmse_comparison)
rmse_comparison <- rmse_comparison[, c(2, 1)]

We can observe that the best model by far is the Random Forest, followed by the Bagging. It is curious that the Random Forest has outperformed the Boosting, since it is a more sophisticated learning method. Therefore chosen model is the Random Forest!

library(ggpubr)
plot_rmse_comparison <- ggplot(data = rmse_comparison, aes(x = reorder(method, desc(RMSE)), y = RMSE, fill = method)) + 
        geom_bar(stat = "identity") + xlab("Statistical Learning Method") + 
        geom_text(aes(label = RMSE), position = position_dodge(width=0.9), vjust=-0.25) +
        ggtitle("Root Mean Squared Error (by Method)") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              axis.title = element_text(hjust = 0.5),
              legend.position = "none",
              legend.title = element_blank())
plot_final <- ggarrange(ensemble_accuracy_plot + ggtitle("Ensemble Model"),
                        ridge_accuracy_plot + ggtitle("Ridge Regression"), lasso_accuracy_plot + ggtitle("Lasso Regression"),
                        gboosting_accuracy_plot + ggtitle("Gradient Boosting"),
                        bag_accuracy_plot + ggtitle("Bagging"),
                        rfsqrt_accuracy_plot + ggtitle("Random Forest"))
plot_rmse_comparison

plot_final

THANKS FOR YOUR ATTENTION :)